aregmi.net
Resume

ML Cheat Sheet


  1. Regression (Predicting a Number) Used when the output is a continuous value (e.g., price, temperature, stock value).
  1. Classification (Predicting a Category) Used when the output is a discrete label (e.g., Yes/No, Red/Blue/Green).
  1. Clustering (Finding Hidden Groups) Unsupervised learning; the data has no labels. The computer finds patterns on its own.
  1. Time Series (Predicting the Future) Used for data ordered by time (daily sales, hourly sensor readings).
  1. Recommendation Engines

Quick Comparison Table Technique Goal Type Library (PySpark) Linear Regression Predict a number Supervised pyspark.ml.regression Logistic Regression Predict a category (0 or 1) Supervised pyspark.ml.classification Naive Bayes Classify text/labels Supervised pyspark.ml.classification K-Means Find hidden groups Unsupervised pyspark.ml.clustering ARIMA/SARIMA Forecast future time steps Stats/Time Series statsmodels (via Pandas UDF) Random Forest High-accuracy classification Supervised pyspark.ml.classification The PySpark Workflow Pattern In PySpark, almost every ML task follows this same 3-step pattern:

  1. VectorAssembler: Combine your feature columns into a single "features" vector column.
  2. Fit: Train the model: model = algorithm.fit(df).
  3. Transform: Make predictions: predictions = model.transform(new_df).

ML Frameworks: scikit-learn, TensorFlow, PyTorch, and More

scikit-learn (sklearn)

TensorFlow (and Keras)

PyTorch

Other ML Libraries


When to Use What?

Framework Best For Not For
scikit-learn Tabular/classical ML, fast protos Deep learning, images
TensorFlow Deep learning, production, scale Small tabular problems
PyTorch Deep learning research, NLP, CV Simple tabular ML
XGBoost/LGBM Tabular, competitions, accuracy Deep learning, images
statsmodels Statistical analysis, time series Deep learning
spaCy/NLTK NLP preprocessing, pipelines Tabular, vision
Prophet Time series forecasting Classification

General Advice


Example: Keras Neural Network (TensorFlow)

from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X.shape[1],)),
    layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32)
model.evaluate(X_test, y_test)

Example: PyTorch Neural Network

import torch
import torch.nn as nn
import torch.optim as optim
class Net(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(64, 1)
        self.sigmoid = nn.Sigmoid()
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.sigmoid(self.fc2(x))
        return x
model = Net(X.shape[1])
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters())
# Training loop omitted for brevity

Practical ML Model Cheat Sheet (Interview/Assessment Prep)

General ML Workflow (Tabular Data)

  1. Load Data: Read your data into a DataFrame.
  2. Define Features/Target: Select feature columns (X) and target column (y).
  3. Preprocess: Encode non-numeric data if needed (LabelEncoder, OneHotEncoder).
  4. Split: Use train_test_split for train/test sets.
  5. Fit: Train your model (fit on X_train, y_train).
  6. Score/Evaluate: Use score(), accuracy_score, or other metrics on X_test, y_test.

Random Forest (RF) vs Gradient Boosting (GB)

Random Forest:

Gradient Boosting:

When to use:

Typical usage (scikit-learn):

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.score(X_test, y_test)
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
gb.score(X_test, y_test)

Naive Bayes

What is it?

Types:

When to use:

Typical usage (text):

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
model = MultinomialNB()
model.fit(X, y)
model.score(X, y)

Metrics & Confusion Matrix

Confusion Matrix:

Accuracy: (TP + TN) / (TP + TN + FP + FN)


PCA vs LDA

PCA (Principal Component Analysis):

LDA (Linear Discriminant Analysis):


General Tips


Example: Naive Bayes Text Classification

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
texts = ["I love ML", "ML is great", "I hate spam", "spam is bad"]
labels = [1, 1, 0, 0]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
model = MultinomialNB()
model.fit(X, labels)
model.predict(vectorizer.transform(["payment plan"]))

Model Selection Table

Model Use Case Data Type Pros Cons
Linear Regression Predict a number Numeric Simple, interpretable Only linear relationships
Logistic Regression Predict a category (0/1) Numeric/categorical Probabilities, fast Only linear boundaries
Naive Bayes Text/category classification Text/categorical Fast, works for text Strong independence assumption
Random Forest Classification/regression Tabular Robust, less overfitting Slower, less interpretable
Gradient Boosting Classification/regression Tabular High accuracy, flexible Slow, needs tuning
K-Means Clustering Numeric Unsupervised, simple Needs k, only spherical clusters
ARIMA/SARIMA Time series forecasting Time series Handles trends/seasonality Needs stationary data